{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using ColBERT in-memory: Index-Free Encodings & Search\n",
    "\n",
    "Sometimes, building an index doesn't make sense. Maybe you're working with a really small dataset, or one that is really fleeting nature, and will only be relevant to the lifetime of your current instance. In these cases, it can be more efficient to skip all the time-consuming index optimisation, and keep your encodings in-memory to perform ColBERT's magical MaxSim on-the-fly. This doesn't scale very well, but can be very useful in certain settings.\n",
    "\n",
    "In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to **encode documents in-memory**, before **retrieving them with `search_encoded_docs`**.\n",
    "\n",
    "First, as usual, let's load up a pre-trained ColBERT model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/test_rag/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 27, 16:56:39] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/test_rag/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from ragatouille import RAGPretrainedModel\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that our model is loaded, we can load and preprocess some data, as in the previous tutorials:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragatouille.utils import get_wikipedia_page\n",
    "from ragatouille.data import CorpusProcessor\n",
    "\n",
    "corpus_processor = CorpusProcessor()\n",
    "\n",
    "documents = [get_wikipedia_page(\"Hayao Miyazaki\"), get_wikipedia_page(\"Studio Ghibli\"), get_wikipedia_page(\"Princess Mononoke\"), get_wikipedia_page(\"Shrek\")]\n",
    "documents = corpus_processor.process_corpus(documents, chunk_size=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our documents are now fully ready to be encoded! \n",
    "\n",
    "One important note: `encode()` itself will not split your documents, you must pre-process them yourself (using corpus_processor or your preferred chunking approach). However, `encode()` will dynamically set the maximum token length, calculated based on the token length distribution in your corpus, up to the maximum length supported by the model you're using.\n",
    "\n",
    "Just like normal indexing, `encode()` also supports adding metadata to the encoded documents, which will be returned as part of query results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoding 212 documents...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/7 [00:00<?, ?it/s]/Users/bclavie/miniforge3/envs/test_rag/lib/python3.9/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      " 14%|█▍        | 1/7 [00:03<00:21,  3.62s/it]/Users/bclavie/miniforge3/envs/test_rag/lib/python3.9/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 7/7 [00:20<00:00,  2.90s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shapes:\n",
      "encodings: torch.Size([212, 256, 128])\n",
      "doc_masks: torch.Size([212, 256])\n",
      "Documents encoded!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "RAG.encode([x['content'] for x in documents], document_metadatas=[{\"about\": \"ghibli\"} for _ in range(len(documents))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'content': 'The studio is also known for its strict \"no-edits\" policy in licensing their films abroad due to Nausicaä of the Valley of the Wind being heavily edited for the film\\'s release in the United States as Warriors of the Wind.\\n\\n\\n=== Independent era ===\\nBetween 1999 and 2005, Studio Ghibli was a subsidiary brand of Tokuma Shoten; however, that partnership ended in April 2005, when Studio Ghibli was spun off from Tokuma Shoten and was re-established as an independent company with relocated headquarters.\\nOn February 1, 2008, Toshio Suzuki stepped down from the position of Studio Ghibli president, which he had held since 2005, and Koji Hoshino (former president of Walt Disney Japan) took over. Suzuki said he wanted to improve films with his own hands as a producer, rather than demanding this from his employees.',\n",
       "  'score': 15.333166122436523,\n",
       "  'rank': 0,\n",
       "  'result_index': 80,\n",
       "  'document_metadata': {'about': 'ghibli'}},\n",
       " {'content': 'Saeko Himuro\\'s novel Umi ga Kikoeru was serialised in the magazine and subsequently adapted into Ocean Waves, Studio Ghibli\\'s first animated feature-length film created for television. It was directed by Tomomi Mochizuki.In October 2001, the Ghibli Museum opened in Mitaka, Tokyo. It contains exhibits based on Studio Ghibli films and shows animations, including a number of short Studio Ghibli films not available elsewhere.\\nThe studio is also known for its strict \"no-edits\" policy in licensing their films abroad due to Nausicaä of the Valley of the Wind being heavily edited for the film\\'s release in the United States as Warriors of the Wind.',\n",
       "  'score': 14.356232643127441,\n",
       "  'rank': 1,\n",
       "  'result_index': 79,\n",
       "  'document_metadata': {'about': 'ghibli'}},\n",
       " {'content': \"Studio Ghibli, Inc. (Japanese: 株式会社スタジオジブリ, Hepburn: Kabushiki-gaisha Sutajio Jiburi) is a Japanese animation studio based in Koganei, Tokyo. It has a strong presence in the animation industry and has expanded its portfolio to include various media formats, such as short subjects, television commercials, and two television films. Their work has been well-received by audiences and recognized with numerous awards. Their mascot and most recognizable symbol, the character Totoro from the 1988 film My Neighbor Totoro, is a giant spirit inspired by raccoon dogs (tanuki) and cats (neko). Among the studio's highest-grossing films are Spirited Away (2001), Howl's Moving Castle (2004), and Ponyo (2008).\",\n",
       "  'score': 12.45059871673584,\n",
       "  'rank': 2,\n",
       "  'result_index': 71,\n",
       "  'document_metadata': {'about': 'ghibli'}}]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RAG.search_encoded_docs(query = \"What's Gihbli's famous policy?\", k=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's pretty much it for index-free encoding & querying!\n",
    "\n",
    "But wait, what if your application needs to update dynamically, and accept new documents? Well, that's easy too! A `RAGPretrainedModel` will keep its encoded docs in-memory, and further `encode()` calls will add to it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoding 2 documents...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00, 10.43it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shapes:\n",
      "encodings: torch.Size([2, 256, 128])\n",
      "doc_masks: torch.Size([2, 256])\n",
      "Documents encoded!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'content': \"I'm a new document about the importance of Curry! I love curry, it's the best food! Do you like Curry too?\",\n",
       "  'score': 18.96149444580078,\n",
       "  'rank': 0,\n",
       "  'result_index': 212,\n",
       "  'document_metadata': {'about': 'new_document'}}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_new_document = [\n",
    "    \"I'm a new document about the importance of Curry! I love curry, it's the best food! Do you like Curry too?\",\n",
    "    \"I'm a second new document!\"\n",
    "]\n",
    "RAG.encode(my_new_document, document_metadatas=[{\"about\": \"new_document\"} for _ in range(len(my_new_document))])\n",
    "RAG.search_encoded_docs(query = \"What's the best food?\", k=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if you want to keep your current `RAGPretrainedModel` loaded, but empty the in-memory encodings because the docs are expired and you need to encode new ones? You can do that easily too: just call `clear_encoded_docs()`. By default, this will wait for 10 seconds before deleting everything, but you can pass `force=True` to delete immediately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All in-memory encodings will be deleted in 10 seconds, interrupt now if you want to keep them!\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "RAG.clear_encoded_docs()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can now encode new documents and query them, with no trace of the previous encodings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoding 2 documents...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00,  4.49it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shapes:\n",
      "encodings: torch.Size([2, 256, 128])\n",
      "doc_masks: torch.Size([2, 256])\n",
      "Documents encoded!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "RAG.encode(documents=[\"This a really good document about Ratatouille. Ratatouille is a French dish...\",\n",
    "                      \"This is a document that is absolutely and utterly relevant to anything\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'content': 'This a really good document about Ratatouille. Ratatouille is a French dish...',\n",
       "  'score': 8.764448165893555,\n",
       "  'rank': 0,\n",
       "  'result_index': 0}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RAG.search_encoded_docs(query = \"What do you know about dishes? Curry maybe?\", k=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here it is! No trace of our previous, very important document about curry, but we can enjoy some Ratatouille facts instead."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ragatouille",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
